Probability Box with Kernel Density Estimation

Weather forecasts got me thinking about data. A simple historical table — temperature, humidity, rain — and the question: given all this data, what can I actually say about the probability of different combinations of conditions occurring together?

That question led me to build what I started calling a “probability box” — a structure that turns a dataset into a queryable probability space. Once you have it, you can ask about cumulative probabilities: the probability that temperature is above X and humidity is below Y and rain is below Z, all at once.

There are various ways to build this. I went with Kernel Density Estimation because it makes minimal assumptions about the underlying distributions.

Let’s walk through an example.

library(ks)
library(data.table)
library(plotly)

# Generating Random Data with Three Variables
set.seed(1)
n <- 1000
temp <- rnorm(n,mean=25,sd=2.47)
hum <- rnorm(n,mean=15,sd=0.2)
rain <- rnorm(n,mean=3.46,sd=4.07)
xyz <- cbind(temp, hum, rain)

Three variables: temperature in Celsius, humidity as a percentage, rain in millimeters. Generated independently here, but the approach works the same with correlated real data.

Kernel Density Estimation (KDE) estimates the probability density function of the data non-parametrically — no assumption that the data follows a normal distribution or any other named family. It places a kernel (typically a Gaussian) at each data point and sums them up to get a smooth density estimate.

To compute probabilities over regions of the joint space, we need to evaluate the KDE on a grid. The grid resolution determines how precisely you can answer probability queries — finer grids give more accurate estimates at the cost of computation time and memory.

# Setting up Grid for KDE
temp_min <- min(temp)
temp_max <- max(temp)
cx <- 0.1

hum_min <- min(hum)
hum_max <-max(hum)
cy <- 0.1

rain_min <- min(rain)
rain_max <- max(rain)
cz <- 0.1

pts.x <- seq(temp_min, temp_max, cx)
pts.y <- seq(hum_min, hum_max, cy)
pts.z <- seq(rain_min, rain_max, cz)

pts <- expand.grid(temp = pts.x, hum = pts.y, rain = pts.z)

Now run KDE on the grid.

# Performing KDE
f_kde <- kde(xyz, eval.points = pts)

Before doing anything with the estimates, it’s worth checking that they integrate to 1 over the full space — a basic sanity check that the density is properly normalized.

# Assigning KDE Estimates to Data Frame
pts$est <- f_kde$estimate

# Checking KDE Integration
integration_check <- sum(pts$est) * cx * cy * cz

# Displaying the result of the integration check
cat("Integration Check Result:", integration_check, "\n")

Integration Check Result: 0.9945768

Good. Now the probability queries. To find the cumulative probability over a region, filter to that region and sum the densities multiplied by the grid cell volume. For example: probability of temperature above 17, humidity below 20, and rain below 5.

# Querying Cumulative Probability in a Specific Area

setDT(pts)
cumulative_prob_area <- pts[temp >17 & hum <20  & rain <5, .(pkde = sum(est) * cx * cy * cz)]

cat("Cumulative Probability in Area:", cumulative_prob_area$pkde, "\n")

Cumulative Probability in Area: 0.6370082

Finding the most probable combination of values is just finding the point with maximum density.

# Find the index of the maximum density
max_density_index <-which.max(f_kde$estimate)

# Extract the corresponding combination of values from the grid
print(pts[max_density_index, c("temp", "hum", "rain")], row.names = FALSE)

     temp      hum     rain
    <num>    <num>    <num>
 24.47012 14.94936 3.453885

One caveat worth mentioning: this approach treats the three variables as if their dependence structure can be captured entirely by the joint KDE. For the synthetic data here, where the variables were generated independently, that’s not a problem. With real data where the variables are correlated — especially in the tails — a copula-based approach would be more appropriate, because copulas explicitly model the dependence structure separately from the marginals.

KDE also tends to struggle with heavy-tailed distributions. If your data has extreme values that matter, the density in the tails will be underestimated. Again, copulas are a better tool in that regime.

For exploratory work with moderate data and no extreme tail behavior, though, this is a fast and interpretable approach. The probability box gives you a queryable view of your data’s multivariate structure with very few assumptions — which is often exactly what you want when you’re first trying to understand what’s going on.